{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "来动手做一个练习，做到学以致用。 这次，我们来爬取链家网的一些内容，用的工具依旧是大家熟悉的requests和BeautifulSoup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.准备工作"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "编写爬虫前的准备工作，我们需要导入用到的库，这里主要使用的是requests和BeautifulSoup两个。还有一个Time库，负责设置每次抓取的休息时间。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "import time\n",
    "from bs4 import BeautifulSoup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.抓取列表页"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "开始抓取前当然应该了解一下目标网站URL结构咯。<br>\n",
    "链家网的二手房列表页面共有100个，URL结构为[http://bj.lianjia.com/ershoufang/pg9/](http://bj.lianjia.com/ershoufang/pg9/) 其中\n",
    "\n",
    "bj表示城市<br>\n",
    "/ershoufang/是频道名称<br>\n",
    "pg9是页面码。\n",
    "\n",
    "我们要抓取的是北京的二手房频道，所以前面的部分不会变，属于固定部分，后面的页面码需要在1-100间变化，属于可变部分。将URL分为两部分，前面的固定部分赋值给url，后面的可变部分使用for循环遍历页面。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "#设置列表页URL的固定部分\n",
    "url='http://bj.lianjia.com/ershoufang/'\n",
    "#设置页面页的可变部分\n",
    "page=('page')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "str"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(page)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里提一个小小的醒，我们最好在http请求中设置一个头部信息，否则很容易被封ip。头部信息网上有很多现成的，也可以使用httpwatch等工具来查看。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "#设置请求头部信息\n",
    "headers = {\n",
    "    'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',\n",
    "'Accept':'text/html;q=0.9,*/*;q=0.8',\n",
    "'Accept-Charset':'ISO-8859-1,utf-8;q=0.7,*;q=0.3',\n",
    "'Accept-Encoding':'gzip',\n",
    "'Connection':'close',\n",
    "'Referer':'http://www.baidu.com/link?url=_andhfsjjjKRgEWkj7i9cFmYYGsisrnm2A-TN3XZDQXxvGsM9k9ZZSnikW2Yds4s&amp;amp;wd=&amp;amp;eqid=c3435a7d00146bd600000003582bfd1f'\n",
    "\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们使用for循环生成1-100的数字，转化格式后与前面的URL固定部分拼成要抓取的URL。这里我们设置每两个页面间隔0.5秒。抓取到的页面保存在html中。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n",
      "奇数\n",
      "偶数\n"
     ]
    }
   ],
   "source": [
    "for i in range(100):\n",
    "    if i % 2 == 0:\n",
    "        print('奇数')\n",
    "    else:\n",
    "        print('偶数')\n",
    "    # 每次间隔1秒\n",
    "    time.sleep(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "#循环抓取列表页信息\n",
    "for i in range(1, 2):\n",
    "    if i == 1:\n",
    "        i = str(i)\n",
    "        a = (url + page + i + '/')\n",
    "        r = requests.post(url=a,headers=headers)\n",
    "        html = r.content\n",
    "    else:\n",
    "        i = str(i)\n",
    "        a = (url + page + i + '/')\n",
    "        r = requests.post(url=a,headers=headers)\n",
    "        html2 = r.content\n",
    "        html = html + html2\n",
    "    # 每次间隔1秒\n",
    "    time.sleep(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 页面解析"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "页面抓取的工作算是完成了，内容在html中，下一步就要进行页面解析了。我们依旧使用BeautifulSoup对页面进行解析"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "#解析抓取的页面内容\n",
    "lj = BeautifulSoup(html,'html.parser')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "完成页面解析后就可以对页面中的关键信息进行提取了。下面我们分别对房源的总价，房源信息和关注度三部分进行提取。 把页面div标签中class=priceInfo的部分提取出来，并使用for循环将其中每个房源的总价数据存在tp中。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "#提取房源总价\n",
    "price = lj.find_all('div','totalPrice')\n",
    "tp = []\n",
    "for a in price:\n",
    "    totalPrice = a.span.string\n",
    "    tp.append(totalPrice)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'458',\n",
       " u'820',\n",
       " u'591',\n",
       " u'370',\n",
       " u'425',\n",
       " u'509',\n",
       " u'265',\n",
       " u'475',\n",
       " u'483',\n",
       " u'425',\n",
       " u'470',\n",
       " u'1145',\n",
       " u'970',\n",
       " u'460',\n",
       " u'599',\n",
       " u'658',\n",
       " u'505',\n",
       " u'435',\n",
       " u'415',\n",
       " u'640',\n",
       " u'337',\n",
       " u'570',\n",
       " u'280',\n",
       " u'500',\n",
       " u'425',\n",
       " u'348',\n",
       " u'359',\n",
       " u'309',\n",
       " u'980',\n",
       " u'230']"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "提取房源信息和关注度的方法与提取房源价格的方法类似，下面是具体的代码，房源信息存储在hi中，关注度存储在fi中。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 提取房源信息\n",
    "houseInfo = lj.find_all('div',attrs={'class':'houseInfo'})\n",
    "\n",
    "hi = []\n",
    "for b in houseInfo:\n",
    "    house = b.get_text()\n",
    "    hi.append(house)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "风雅园二区 /2室1厅/94.2平米/南 北/毛坯/无电梯\n",
      "雅世合金公寓 /2室1厅/90.28平米/南 北/精装/有电梯\n",
      "龙腾苑四区 /3室1厅/128.08平米/东南 南 北/精装/无电梯\n",
      "东泽园 /2室1厅/74.59平米/东南/简装/有电梯\n",
      "育芳园 /2室1厅/78.6平米/南 北/精装/无电梯\n",
      "三环新城7号院 /2室1厅/96.95平米/南 北/精装/有电梯\n",
      "惠泽家园 /2室1厅/90.22平米/南 北/简装/无电梯\n",
      "北店嘉园 /2室1厅/88.14平米/南 北/简装/无电梯\n",
      "龙泽苑西区 /2室1厅/96.24平米/东南 北/简装/无电梯\n",
      "佳运园二期 /2室1厅/88.68平米/南 北/简装/无电梯\n",
      "新龙城 /2室1厅/86.76平米/东北/其他/有电梯\n",
      "大成郡 /3室2厅/136.6平米/南 西/精装\n",
      "奥林匹克花园三期 /4室2厅/151平米/南 北/精装/有电梯\n",
      "风雅园一区 /3室1厅/115.32平米/南 北/其他/无电梯\n",
      "新龙城 /3室1厅/113.19平米/南 北/精装/有电梯\n",
      "翡翠城五期 /3室2厅/114.44平米/南 北/精装/有电梯\n",
      "鸿业兴园一区 /2室1厅/88.78平米/西南/精装/有电梯\n",
      "龙锦苑东五区 /2室1厅/99.59平米/南 北/精装/无电梯\n",
      "马家堡67号院 /2室1厅/78.05平米/南/简装/有电梯\n",
      "农科院 /3室0厅/57.8平米/南 北/精装/无电梯\n",
      "龙博苑三区 /1室1厅/58.91平米/南/简装/无电梯\n",
      "都城心屿 /2室1厅/77.29平米/东/精装/有电梯\n",
      "中海御鑫阁 /1室0厅/40.86平米/东 南/精装/有电梯\n",
      "郁花园一里 /2室2厅/111.49平米/南 北/精装/有电梯\n",
      "朝阳新城六区 /2室1厅/84.62平米/南/简装/有电梯\n",
      "韩庄子三里 /2室1厅/62.9平米/南 北/简装/无电梯\n",
      "沸城 /2室1厅/85.44平米/南/简装/有电梯\n",
      "天下儒寓 /1室1厅/68.75平米/南/精装/有电梯\n",
      "奥林匹克花园一期 /5室2厅/216.91平米/南 北/简装/无电梯\n",
      "中海御鑫阁 /1室0厅/30.92平米/西/精装/有电梯\n"
     ]
    }
   ],
   "source": [
    "for item in hi:\n",
    "    print item"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "#提取房源关注度\n",
    "followInfo=lj.find_all('div',attrs={'class':'followInfo'})\n",
    "\n",
    "fi = []\n",
    "for c in followInfo:\n",
    "    follow=c.get_text()\n",
    "    fi.append(follow)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "19人关注/24次带看近地铁VR房源房本满五年随时看房458万单价48620元/平米\n",
      "110人关注/50次带看VR房源房本满五年随时看房820万单价90829元/平米\n",
      "46人关注/53次带看近地铁VR房源房本满五年随时看房591万单价46144元/平米\n",
      "57人关注/54次带看VR房源房本满五年随时看房370万单价49605元/平米\n",
      "134人关注/29次带看近地铁VR房源房本满五年随时看房425万单价54072元/平米\n",
      "113人关注/60次带看近地铁VR房源房本满五年随时看房509万单价52502元/平米\n",
      "77人关注/24次带看VR房源房本满五年随时看房265万单价29373元/平米\n",
      "12人关注/22次带看近地铁VR房源房本满五年475万单价53892元/平米\n",
      "41人关注/31次带看近地铁VR房源房本满五年随时看房483万单价50188元/平米\n",
      "22人关注/28次带看VR房源房本满五年随时看房425万单价47926元/平米\n",
      "37人关注/51次带看近地铁VR房源房本满两年随时看房470万单价54173元/平米\n",
      "109人关注/10次带看VR房源房本满五年随时看房1145万单价83822元/平米\n",
      "146人关注/36次带看VR房源房本满五年随时看房970万单价64239元/平米\n",
      "38人关注/22次带看近地铁VR房源房本满五年随时看房460万单价39890元/平米\n",
      "22940人关注/25次带看近地铁VR房源房本满五年随时看房599万单价52920元/平米\n",
      "99人关注/28次带看VR房源房本满五年随时看房658万单价57498元/平米\n",
      "112人关注/56次带看VR房源房本满五年505万单价56883元/平米\n",
      "42人关注/41次带看VR房源房本满五年435万单价43680元/平米\n",
      "205人关注/27次带看近地铁VR房源房本满五年415万单价53172元/平米\n",
      "100人关注/45次带看近地铁VR房源房本满五年随时看房640万单价110727元/平米\n",
      "20人关注/88次带看近地铁VR房源房本满五年随时看房337万单价57206元/平米\n",
      "21人关注/48次带看近地铁VR房源房本满五年随时看房570万单价73749元/平米\n",
      "66人关注/32次带看近地铁VR房源房本满五年随时看房280万单价68527元/平米\n",
      "14人关注/21次带看近地铁VR房源房本满五年随时看房500万单价44848元/平米\n",
      "31人关注/24次带看VR房源房本满五年随时看房425万单价50225元/平米\n",
      "33人关注/23次带看近地铁VR房源房本满五年随时看房348万单价55326元/平米\n",
      "88人关注/30次带看VR房源房本满五年359万单价42018元/平米\n",
      "16人关注/21次带看近地铁VR房源房本满五年随时看房309万单价44946元/平米\n",
      "55人关注/21次带看VR房源房本满五年随时看房980万单价45181元/平米\n",
      "18人关注/28次带看近地铁VR房源房本满两年随时看房230万单价74386元/平米\n"
     ]
    }
   ],
   "source": [
    "for item in fi:\n",
    "    print item"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 清洗数据并整理到数据表中"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>followinfo</th>\n",
       "      <th>houseinfo</th>\n",
       "      <th>totalprice</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>19人关注/24次带看近地铁VR房源房本满五年随时看房458万单价48620元/平米</td>\n",
       "      <td>风雅园二区 /2室1厅/94.2平米/南 北/毛坯/无电梯</td>\n",
       "      <td>458</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>110人关注/50次带看VR房源房本满五年随时看房820万单价90829元/平米</td>\n",
       "      <td>雅世合金公寓 /2室1厅/90.28平米/南 北/精装/有电梯</td>\n",
       "      <td>820</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>46人关注/53次带看近地铁VR房源房本满五年随时看房591万单价46144元/平米</td>\n",
       "      <td>龙腾苑四区 /3室1厅/128.08平米/东南 南 北/精装/无电梯</td>\n",
       "      <td>591</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>57人关注/54次带看VR房源房本满五年随时看房370万单价49605元/平米</td>\n",
       "      <td>东泽园 /2室1厅/74.59平米/东南/简装/有电梯</td>\n",
       "      <td>370</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>134人关注/29次带看近地铁VR房源房本满五年随时看房425万单价54072元/平米</td>\n",
       "      <td>育芳园 /2室1厅/78.6平米/南 北/精装/无电梯</td>\n",
       "      <td>425</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                    followinfo  \\\n",
       "0   19人关注/24次带看近地铁VR房源房本满五年随时看房458万单价48620元/平米   \n",
       "1     110人关注/50次带看VR房源房本满五年随时看房820万单价90829元/平米   \n",
       "2   46人关注/53次带看近地铁VR房源房本满五年随时看房591万单价46144元/平米   \n",
       "3      57人关注/54次带看VR房源房本满五年随时看房370万单价49605元/平米   \n",
       "4  134人关注/29次带看近地铁VR房源房本满五年随时看房425万单价54072元/平米   \n",
       "\n",
       "                            houseinfo totalprice  \n",
       "0       风雅园二区 /2室1厅/94.2平米/南 北/毛坯/无电梯        458  \n",
       "1     雅世合金公寓 /2室1厅/90.28平米/南 北/精装/有电梯        820  \n",
       "2  龙腾苑四区 /3室1厅/128.08平米/东南 南 北/精装/无电梯        591  \n",
       "3         东泽园 /2室1厅/74.59平米/东南/简装/有电梯        370  \n",
       "4         育芳园 /2室1厅/78.6平米/南 北/精装/无电梯        425  "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 导入pandas 库\n",
    "import pandas as pd\n",
    "# 创建数据表\n",
    "house = pd.DataFrame({'totalprice':tp,'houseinfo':hi,'followinfo':fi})\n",
    "# 查看数据表的内容\n",
    "house.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "很尴尬的是，大家看得到，很多信息是糊在一块的，不能直接使用，所以咱们再做一些数据提取和清洗的工作。如房源信息，在表中每个房源的小区名称，户型，面积，朝向等信息都在一个字段中，无法直接使用。需要先进行分列操作。这里的规则比较明显，每个信息间都是以竖线分割的，因此我们只需要以竖线进行分列即可。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "#对房源信息进行分列\n",
    "houseinfo_split = pd.DataFrame((x.split('/') for x in house.houseinfo),index = house.index,\n",
    "                              columns=['xiaoqu','huxing','mianji','chaoyang','zhuangxiu','dianti'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>xiaoqu</th>\n",
       "      <th>huxing</th>\n",
       "      <th>mianji</th>\n",
       "      <th>chaoyang</th>\n",
       "      <th>zhuangxiu</th>\n",
       "      <th>dianti</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>风雅园二区</td>\n",
       "      <td>2室1厅</td>\n",
       "      <td>94.2平米</td>\n",
       "      <td>南 北</td>\n",
       "      <td>毛坯</td>\n",
       "      <td>无电梯</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>雅世合金公寓</td>\n",
       "      <td>2室1厅</td>\n",
       "      <td>90.28平米</td>\n",
       "      <td>南 北</td>\n",
       "      <td>精装</td>\n",
       "      <td>有电梯</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>龙腾苑四区</td>\n",
       "      <td>3室1厅</td>\n",
       "      <td>128.08平米</td>\n",
       "      <td>东南 南 北</td>\n",
       "      <td>精装</td>\n",
       "      <td>无电梯</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>东泽园</td>\n",
       "      <td>2室1厅</td>\n",
       "      <td>74.59平米</td>\n",
       "      <td>东南</td>\n",
       "      <td>简装</td>\n",
       "      <td>有电梯</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>育芳园</td>\n",
       "      <td>2室1厅</td>\n",
       "      <td>78.6平米</td>\n",
       "      <td>南 北</td>\n",
       "      <td>精装</td>\n",
       "      <td>无电梯</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    xiaoqu huxing    mianji chaoyang zhuangxiu dianti\n",
       "0   风雅园二区    2室1厅    94.2平米      南 北        毛坯    无电梯\n",
       "1  雅世合金公寓    2室1厅   90.28平米      南 北        精装    有电梯\n",
       "2   龙腾苑四区    3室1厅  128.08平米   东南 南 北        精装    无电梯\n",
       "3     东泽园    2室1厅   74.59平米       东南        简装    有电梯\n",
       "4     育芳园    2室1厅    78.6平米      南 北        精装    无电梯"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#查看分列结果\n",
    "houseinfo_split.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "把拆分后的数据拼接回原始数据中"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "house = pd.merge(house,houseinfo_split,right_index=True, left_index=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>followinfo</th>\n",
       "      <th>houseinfo</th>\n",
       "      <th>totalprice</th>\n",
       "      <th>xiaoqu</th>\n",
       "      <th>huxing</th>\n",
       "      <th>mianji</th>\n",
       "      <th>chaoyang</th>\n",
       "      <th>zhuangxiu</th>\n",
       "      <th>dianti</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>19人关注/24次带看近地铁VR房源房本满五年随时看房458万单价48620元/平米</td>\n",
       "      <td>风雅园二区 /2室1厅/94.2平米/南 北/毛坯/无电梯</td>\n",
       "      <td>458</td>\n",
       "      <td>风雅园二区</td>\n",
       "      <td>2室1厅</td>\n",
       "      <td>94.2平米</td>\n",
       "      <td>南 北</td>\n",
       "      <td>毛坯</td>\n",
       "      <td>无电梯</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>110人关注/50次带看VR房源房本满五年随时看房820万单价90829元/平米</td>\n",
       "      <td>雅世合金公寓 /2室1厅/90.28平米/南 北/精装/有电梯</td>\n",
       "      <td>820</td>\n",
       "      <td>雅世合金公寓</td>\n",
       "      <td>2室1厅</td>\n",
       "      <td>90.28平米</td>\n",
       "      <td>南 北</td>\n",
       "      <td>精装</td>\n",
       "      <td>有电梯</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>46人关注/53次带看近地铁VR房源房本满五年随时看房591万单价46144元/平米</td>\n",
       "      <td>龙腾苑四区 /3室1厅/128.08平米/东南 南 北/精装/无电梯</td>\n",
       "      <td>591</td>\n",
       "      <td>龙腾苑四区</td>\n",
       "      <td>3室1厅</td>\n",
       "      <td>128.08平米</td>\n",
       "      <td>东南 南 北</td>\n",
       "      <td>精装</td>\n",
       "      <td>无电梯</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>57人关注/54次带看VR房源房本满五年随时看房370万单价49605元/平米</td>\n",
       "      <td>东泽园 /2室1厅/74.59平米/东南/简装/有电梯</td>\n",
       "      <td>370</td>\n",
       "      <td>东泽园</td>\n",
       "      <td>2室1厅</td>\n",
       "      <td>74.59平米</td>\n",
       "      <td>东南</td>\n",
       "      <td>简装</td>\n",
       "      <td>有电梯</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>134人关注/29次带看近地铁VR房源房本满五年随时看房425万单价54072元/平米</td>\n",
       "      <td>育芳园 /2室1厅/78.6平米/南 北/精装/无电梯</td>\n",
       "      <td>425</td>\n",
       "      <td>育芳园</td>\n",
       "      <td>2室1厅</td>\n",
       "      <td>78.6平米</td>\n",
       "      <td>南 北</td>\n",
       "      <td>精装</td>\n",
       "      <td>无电梯</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                    followinfo  \\\n",
       "0   19人关注/24次带看近地铁VR房源房本满五年随时看房458万单价48620元/平米   \n",
       "1     110人关注/50次带看VR房源房本满五年随时看房820万单价90829元/平米   \n",
       "2   46人关注/53次带看近地铁VR房源房本满五年随时看房591万单价46144元/平米   \n",
       "3      57人关注/54次带看VR房源房本满五年随时看房370万单价49605元/平米   \n",
       "4  134人关注/29次带看近地铁VR房源房本满五年随时看房425万单价54072元/平米   \n",
       "\n",
       "                            houseinfo totalprice   xiaoqu huxing    mianji  \\\n",
       "0       风雅园二区 /2室1厅/94.2平米/南 北/毛坯/无电梯        458   风雅园二区    2室1厅    94.2平米   \n",
       "1     雅世合金公寓 /2室1厅/90.28平米/南 北/精装/有电梯        820  雅世合金公寓    2室1厅   90.28平米   \n",
       "2  龙腾苑四区 /3室1厅/128.08平米/东南 南 北/精装/无电梯        591   龙腾苑四区    3室1厅  128.08平米   \n",
       "3         东泽园 /2室1厅/74.59平米/东南/简装/有电梯        370     东泽园    2室1厅   74.59平米   \n",
       "4         育芳园 /2室1厅/78.6平米/南 北/精装/无电梯        425     育芳园    2室1厅    78.6平米   \n",
       "\n",
       "  chaoyang zhuangxiu dianti  \n",
       "0      南 北        毛坯    无电梯  \n",
       "1      南 北        精装    有电梯  \n",
       "2   东南 南 北        精装    无电梯  \n",
       "3       东南        简装    有电梯  \n",
       "4      南 北        精装    无电梯  "
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "house.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用相同的方法对房源关注度字段进行分列和拼接操作。这里的分列规则是斜杠。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "#对房源关注度进行分列\n",
    "followinfo_split = pd.DataFrame((x.split('/') for x in house.followinfo),index=house.index, columns=['guanzhu','daikan','fabu'])\n",
    "#将分列后的关注度信息拼接回原数据表\n",
    "house = pd.merge(house,followinfo_split,right_index=True, left_index=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>followinfo</th>\n",
       "      <th>houseinfo</th>\n",
       "      <th>totalprice</th>\n",
       "      <th>xiaoqu</th>\n",
       "      <th>huxing</th>\n",
       "      <th>mianji</th>\n",
       "      <th>chaoyang</th>\n",
       "      <th>zhuangxiu</th>\n",
       "      <th>dianti</th>\n",
       "      <th>guanzhu</th>\n",
       "      <th>daikan</th>\n",
       "      <th>fabu</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>19人关注/24次带看近地铁VR房源房本满五年随时看房458万单价48620元/平米</td>\n",
       "      <td>风雅园二区 /2室1厅/94.2平米/南 北/毛坯/无电梯</td>\n",
       "      <td>458</td>\n",
       "      <td>风雅园二区</td>\n",
       "      <td>2室1厅</td>\n",
       "      <td>94.2平米</td>\n",
       "      <td>南 北</td>\n",
       "      <td>毛坯</td>\n",
       "      <td>无电梯</td>\n",
       "      <td>19人关注</td>\n",
       "      <td>24次带看近地铁VR房源房本满五年随时看房458万单价48620元</td>\n",
       "      <td>平米</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>110人关注/50次带看VR房源房本满五年随时看房820万单价90829元/平米</td>\n",
       "      <td>雅世合金公寓 /2室1厅/90.28平米/南 北/精装/有电梯</td>\n",
       "      <td>820</td>\n",
       "      <td>雅世合金公寓</td>\n",
       "      <td>2室1厅</td>\n",
       "      <td>90.28平米</td>\n",
       "      <td>南 北</td>\n",
       "      <td>精装</td>\n",
       "      <td>有电梯</td>\n",
       "      <td>110人关注</td>\n",
       "      <td>50次带看VR房源房本满五年随时看房820万单价90829元</td>\n",
       "      <td>平米</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>46人关注/53次带看近地铁VR房源房本满五年随时看房591万单价46144元/平米</td>\n",
       "      <td>龙腾苑四区 /3室1厅/128.08平米/东南 南 北/精装/无电梯</td>\n",
       "      <td>591</td>\n",
       "      <td>龙腾苑四区</td>\n",
       "      <td>3室1厅</td>\n",
       "      <td>128.08平米</td>\n",
       "      <td>东南 南 北</td>\n",
       "      <td>精装</td>\n",
       "      <td>无电梯</td>\n",
       "      <td>46人关注</td>\n",
       "      <td>53次带看近地铁VR房源房本满五年随时看房591万单价46144元</td>\n",
       "      <td>平米</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>57人关注/54次带看VR房源房本满五年随时看房370万单价49605元/平米</td>\n",
       "      <td>东泽园 /2室1厅/74.59平米/东南/简装/有电梯</td>\n",
       "      <td>370</td>\n",
       "      <td>东泽园</td>\n",
       "      <td>2室1厅</td>\n",
       "      <td>74.59平米</td>\n",
       "      <td>东南</td>\n",
       "      <td>简装</td>\n",
       "      <td>有电梯</td>\n",
       "      <td>57人关注</td>\n",
       "      <td>54次带看VR房源房本满五年随时看房370万单价49605元</td>\n",
       "      <td>平米</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>134人关注/29次带看近地铁VR房源房本满五年随时看房425万单价54072元/平米</td>\n",
       "      <td>育芳园 /2室1厅/78.6平米/南 北/精装/无电梯</td>\n",
       "      <td>425</td>\n",
       "      <td>育芳园</td>\n",
       "      <td>2室1厅</td>\n",
       "      <td>78.6平米</td>\n",
       "      <td>南 北</td>\n",
       "      <td>精装</td>\n",
       "      <td>无电梯</td>\n",
       "      <td>134人关注</td>\n",
       "      <td>29次带看近地铁VR房源房本满五年随时看房425万单价54072元</td>\n",
       "      <td>平米</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                    followinfo  \\\n",
       "0   19人关注/24次带看近地铁VR房源房本满五年随时看房458万单价48620元/平米   \n",
       "1     110人关注/50次带看VR房源房本满五年随时看房820万单价90829元/平米   \n",
       "2   46人关注/53次带看近地铁VR房源房本满五年随时看房591万单价46144元/平米   \n",
       "3      57人关注/54次带看VR房源房本满五年随时看房370万单价49605元/平米   \n",
       "4  134人关注/29次带看近地铁VR房源房本满五年随时看房425万单价54072元/平米   \n",
       "\n",
       "                            houseinfo totalprice   xiaoqu huxing    mianji  \\\n",
       "0       风雅园二区 /2室1厅/94.2平米/南 北/毛坯/无电梯        458   风雅园二区    2室1厅    94.2平米   \n",
       "1     雅世合金公寓 /2室1厅/90.28平米/南 北/精装/有电梯        820  雅世合金公寓    2室1厅   90.28平米   \n",
       "2  龙腾苑四区 /3室1厅/128.08平米/东南 南 北/精装/无电梯        591   龙腾苑四区    3室1厅  128.08平米   \n",
       "3         东泽园 /2室1厅/74.59平米/东南/简装/有电梯        370     东泽园    2室1厅   74.59平米   \n",
       "4         育芳园 /2室1厅/78.6平米/南 北/精装/无电梯        425     育芳园    2室1厅    78.6平米   \n",
       "\n",
       "  chaoyang zhuangxiu dianti guanzhu                             daikan fabu  \n",
       "0      南 北        毛坯    无电梯   19人关注  24次带看近地铁VR房源房本满五年随时看房458万单价48620元   平米  \n",
       "1      南 北        精装    有电梯  110人关注     50次带看VR房源房本满五年随时看房820万单价90829元   平米  \n",
       "2   东南 南 北        精装    无电梯   46人关注  53次带看近地铁VR房源房本满五年随时看房591万单价46144元   平米  \n",
       "3       东南        简装    有电梯   57人关注     54次带看VR房源房本满五年随时看房370万单价49605元   平米  \n",
       "4      南 北        精装    无电梯  134人关注  29次带看近地铁VR房源房本满五年随时看房425万单价54072元   平米  "
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "house.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数据已经有了，pandas数据分析大家已经学过了，要不要动手试试看？ 可视化的部分，之后的课程会提到"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
